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Abstract 

Sample compression schemes were defined by Littlestone and Warmuth (1986) 
as an abstraction of the structure underlying many learning algorithms. Roughly 
speaking, a sample compression scheme of size k means that given an arbitrary list 
of labeled examples, one can retain only k of them in a way that allows to recover the 
labels of all other examples in the list. They showed that compression implies PAC 
learnability for binary-labeled classes, and asked whether the other direction holds. 
We answer their question and show that every concept class C with VC dimension 
d has a sample compression scheme of size exponential in d. The proof uses an 
approximate minimax phenomenon for binary matrices of low VC dimension, which 
may be of interest in the context of game theory. 


1 Introduction 

Learning and compression are known to be deeply related to each other. Learning proce¬ 
dures perform compression, and compression is an evidence of and is useful in learning. 
For example, support vector machines, which are commonly applied to solve classifica¬ 
tion problems, perform compression (see Chapter 6 in [6]). Another example is the use 
of compression to boost the accuracy of learning procedures (see [23, 11] and Chapter 4 
in [ 3]). 

About thirty years ago, Littlestone and Warmuth [23] provided a mathematical frame¬ 
work for studying compression in the context of learning theory. In a nutshell, they showed 
that compression indeed implies learnability and asked whether learnability implies com¬ 
pression. 

* Departments of Computer Science, Technion-IIT, Israel and Max Planck Institute for Informatics, 
Saarbriicken, Germany, shaymrn@cs.technion.ac.il. Research is supported by ISF and BSF. 

^Department of Mathematics, Technion-IIT, Israel, amir.yehudayoff@gmail.com. Horev fellow - 
supported by the Taub foundation. Research is also supported by ISF and BSF. 


1 



1.1 Learning 

Here we provide a brief description of standard learning terminology. For more informa¬ 
tion, see the books [18, 13, 6]. 

Imagine a student who wishes to learn a concept c : X —> {0,1} by observing some 
training examples. In order to eliminate measurability issues, we focus on the case that 
A" is a finite or countable set (although the arguments we use are more general). The high 
level goal of the student is to come up with an hypothesis h : X —>• {0,1} that is close 
to the unknown concept c using the least number of training examples. There are many 
possible ways to formally define the student’s objective. An important one is Valiant’s 
probably approximately correct (PAC) learning model [34], which is closely related to an 
earlier work of Vapnik and Chervonenkis [35]. This model is defined as follows. 

The training examples are modeled as a pair (V, y ) where Y C A" is the multiset of 
points the student observes and y — c\y is their labels according to c. The collection of 
all possible training examples is defined as follows. Let C C {0,1} A be a concept class. A 
C'-labeled sample is a pair (V, y), where Y C X is a multiset and y — c\y for some c E C. 
The size of a labeled sample (V, y) is the size of Y as a multiset. For an integer k, denote 
by Lc(k) the set of C-labeled samples of size at most k. Denote by Lc(oo) the set of all 
C-labeled samples of finite size. 

The concept class C is PAC learnable with d samples, generalization error e, and 
probability of success 1 — 5 if there is a learning map H : Lc(d ) —> {0,1} A " so that the 
hypothesis H generates is accurate with high probability. Formally, for every c E C and 
for every probability distribution y on A", 


Pr 


[Y E X d : y({x E X : hy(x) ^ c(x)}) < e} 


>1 — 5, 


where hy = H(Y,c\y). In this text, when the parameters e, 5 are not explicitly stated 
we mean that their value is 1/3. If the image of H is contained in C, we say that C is 
properly PAC learnable. 

A fundamental question that emerges is characterizing the sample complexity of PAC 
learning. The work of Blumer, Eherenfeucht, Haussler, and Warmuth [ l], which is based 
on [35], provides such a characterization. The characterization is based on the Vapnik- 
Chervonenkis (VC) dimension of C, which is defined as follows. A set Y C A" is C- 
shattered if for every ZCf there is c G C so that c(x) = 1 for all x E Z and c(x) = 0 
for all x E Y — Z. The VC dimension of C, denoted VC(C), is the maximum size of a 
C'-shattered set (it may be infinite). They proved that the sample complexity of PAC 
learning C is VC(C), up to constant factors 1 . 

Theorem 1.1 (Sample complexity of PAC learning [35, 4]). If C C {0,1} A has VC 
x Big O and P notation means up to absolute constants. 
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dimension d, then C is properly PAC learnable with 0((d log(2/e) + log(2/5))/e) samples, 
generalization error e and success probability 1 — 5. 


1.2 Compression 

Littlestone and Warmuth [23] defined sample compression schemes as a natural abstrac¬ 
tion that captures a common property of many learning procedures, like procedures for 
learning geometric shapes or algebraic structures (see also [9, 10]). 


Definition. A sample compression scheme takes a long list of samples and compresses 
it to a short sub-list of samples in a way that allows to invert the compression. Formally, 
a sample compression scheme for C with kernel size k and side information /, where / is 
a finite set, consists of two maps K,p for which the following hold: 


(ft) The compression map 

ft : Lc{ oo) — > Lc(k ) x / 

takes (y, y) to ((Z, z), i ) with Z C Y and z = y\z- 
( p ) The reconstruction map 

P ■ Lc(k) X / —>• (0,1} A 
is so that for all (Y,y) in Lc{ oo), 


p{n(y,y))\Y = y- 

The size of the scheme is 2 k + log(|/|). In the language of coding theory, the side in¬ 
formation / can be thought of as list decoding; the map p has a short list of possible 
reconstructions of a given (Z,z), and the information i E I indicates which element in 
the list is the correct one. See [9, 10, 25] for more discussions of this definition, and some 
insightful examples. 

Motivation and background. Littlestone and Warmuth showed that every compres¬ 
sion scheme yields a natural learning procedure: Given a labeled sample (Y, y ), the learner 
compresses it to n(Y,y) and outputs the hypothesis h = p(n(Y, y)). They proved that 
this is indeed a PAC learner. 

Theorem 1.2 (Compression implies learnability [23]). Let C C {0,1} A , and let ft, p be 
a sample compression scheme for C of size k. Let d > 8(Hog(2/e) + log(l/5))/e. Then, 
the learning map H : Lc(d) —» (0, 1} X defined by H(Y,y) = p(K(Y,y)) is PAC learning 
C with d samples, generalization error e and success probability 1 — 5. 

logarithms in this text are base 2. 
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Proof sketch. Let p be a distribution on A", and x\,...,Xd be d independent samples 
from /i. There are JV =0 ( d ) subsets T of [d] of size at most k. There are \I\ choices for 
information i G /. Every fixing of T,i yields a random function hr.i = p((T, c|t),«) that 
is measurable with respect to Xt — (%t '■ t G T ). The random function hx,i is independent 
of X[d]~T- For every fixed T,i,xx, therefore, if p({x G X : h T fix) c(a;)}) > e then the 
probability that hr,i agrees with c on all samples in [d] — T is less than (1 — e) d- l T L The 
function h is one of the functions in the random set {hr,i '■ \T\ < k, i G /}, and it satisfies 
h\y = c\y■ The union bound completes the proof. □ 

Littlestone and Warmuth also asked whether the other direction holds: “Are there 
concept classes with finite dimension for which there is no scheme with bounded kernel 
size and bounded additional information?” 

Further motivation for considering compression schemes comes from the problem of 
boosting a weak learner to a strong learner. Boosting is a central theme in learning 
theory that was initiated by Kearns and Valiant [16, 1 ]. The boosting question, roughly 
speaking, is: given a learning algorithm with generalization error 0.49, can we use it to 
get an algorithm with generalization error e of our choice? Theorem 1.2 implies that if 
the learning algorithm yields a sample compression scheme, then boosting follows with 
a multiplicative overhead of roughly 1/e in the sample size. In other words, efficient 
compression schemes immediately yield boosting. 

Schapire [32] and later on Freund [11] solved the boosting problem, and showed how 
to efficiently boost the generalization error of PAC learners. They showed that if C is 
PAC learnable with d samples and generalization error 0.49, then C is PAC learnable 
with 0(dlog 2 (d/e)/e) samples and generalization error e (see e.g. Corollary 3.3 in [ ]). 

Interestingly, their boosting is based on a weak type of compression. They showed how to 
compress a sample of size m to a sample of size roughly dlogm, and that such compression 
already implies boosting (see Section 1.3 below for more details). 

Additional motivation for studying sample compression schemes relates to feature 
selection, which is about identifying meaningful features of the underlying domain that 
are sufficient for learning purposes (see e.g. [ ]). The existence of efficient compression 

schemes, loosely speaking, shows that in any arbitrarily big data there is a small set of 
features that already contains all the relevant information. More concretely, a construction 
of an efficient compression scheme provides tools that may be helpful for feature selection. 

Previous constructions. Littlestone and Warmuth’s question and variants of it lead 
to a rich body of work that revealed profound properties of VC dimension and learning. 
Floyd and Warmuth [9, 10] constructed sample compression schemes of size log|C| for 
every finite concept class C. They also constructed optimal compression schemes of size 
d for maximum classes 3 of VC dimension d, as a first step towards solving the general 

3 That is, C C {0, 1} X of size |C| = Zj=o ('*') with d = VC(C). 
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question. As the study of sample compression schemes deepened, many insightful and 
optimal schemes for special cases have been constructed: Floyd [9], Helmbold et ah [15], 
Floyd and Warmuth [10], Ben-David and Litman [3], Chernikov and Simon [5], Kuzmin 
and Warmuth [19], Rubinstein et ah [29], Rubinstein and Rubinstein [30], Livni and 
Simon [24] and more. These works discovered and utilized connections between sample 
compression schemes, and model theory, topology, combinatorics, and geometry. Finally, 
in our recent work with Shpilka and Wigderson [25], we constructed sample compression 
schemes of size roughly 2°^ ■ log log |C| for every finite concept class C of VC dimension 
d. 

1.3 Our contribution 

Our main theorem states that VC classes have sample compression schemes of finite size. 
The key property of this compression is that its size does not depend on the size of the 
given sample ( Y,y ). 

Theorem 1.3 (Compression). If C C {0,1}' Y has VC dimension d, then C has a sample 
compression scheme of size 2 0(yd ). 

Our construction (see Section 3) of sample compression schemes is overall quite short 
and simple. It is inspired by Freund’s work [ ] where majority is used to boost the 
accuracy of learning procedures. It also uses several known properties of PAC learnabil- 
ity and VC dimension, together with von Neumann’s minimax theorem, and it reveals 
approximate but efficient equilibrium strategies for zero-sum games of low VC dimension 
(see Section 2 below). 

The construction is even more efficient when the dual class is also under control. The 
dual concept class C* C {0,1} C of C is defined as the set of all functions f x : C —* {0,1} 
defined by f x (c) = c(x). If we think of C as a binary matrix whose rows are concepts 
in C and columns are elements of X, then C* corresponds to the distinct rows of the 
transposed matrix. 

Theorem 1.4 (Compression using dual VC dimension). IfC C {0,1} A has VC dimension 
d > 0 and C* has VC dimension d* > 0, then C has a sample compression scheme of size 
k log k with k = 0(d* ■ d). 

Theorem 1.3 follows from Theorem 1.4 via the following bound, which was observed 
by Assouad [ ]. 

Claim 1.5 (Dual VC dimension [1]). If VC(C ) < d, then VC(C*) < 2 d+l . 

A natural example for which the dual class is well behaved is geometrically defined 
classes. Assume, for example, that C represents the incidence relation among halfspaces 
and points in r-dimensional real space (a.k.a. sign rank or Dudely dimension r). That is, 
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for every c E C there is a vector a c E M r and for every x E X there is a vector b x E M r 
so that c(x) = 1 if and only if the inner product ( a c ,b x ) = Y7j=i a c(j)b x (j ) is positive. It 
follows that VC(C') < r, but the symmetric structure also implies that VC(C*) < r. So, 
the compression scheme constructed here for this C actually has size 0(r 2 logr) and not 
2°(r)' 

Proof background and overview. Freund [ ] and later on Freund and Schapire [ ] 

showed that for every class C that is PAC learnable with d samples, there exists a com¬ 
pression scheme that compresses a C-labeled sample ( Y ., y) of size m to a sub-sample of 
size k = O(dlogm) with additional information of klogk bits (for a more detailed dis¬ 
cussion, see Sections 1.2 and 13.1.5 in [ ]). Their constructive proof is iterative: In each 
iteration t, a distribution y t on Y is carefully and adaptively chosen. Then, d independent 
points from Y are drawn according to y t , and fed into the learning map to produce an 
hypothesis h t . They showed that after T = 0{ log(l/e)) iterations, the majority vote h 
over hi,, Ht is an e-approximation of y with respect to the uniform measure on Y. In 
particular, if we choose e < 1/m, then h completely agrees with y on Y. This makes 
T = O(logm) and gives a sample compression scheme from a sample of size m to a 
sub-sample of size d ■ T = O(dlogm). 

The size of Freund and Schapire’s compression scheme is not uniformly bounded, 
it depends on \Y\. A first step towards removing this dependence is observing that 
their proof can be replaced by a combination of von Neumann’s minimax theorem and a 
Chernoff bound. In this argument, the logm factor eventually comes from a union bound 
over the m samples. The compression scheme presented in this text replaces the union 
bound with a more accurate analysis that utilizes the VC dimension of the dual class. 
This analysis ultimately replaces the log m factor by a d* factor. 


2 Preliminaries 


Approximations. The following theorem shows that every distribution can be approx¬ 
imated by a distribution of small support, when the statistical tests belong to a class of 
small VC dimension. This phenomenon was first proved by Vapnik and Chervonenkis [ 55], 
and was later quantitively improved in [20, 33]. 

Theorem 2.1 (Approximations for bounded VC dimension [35, 20, 33]). Let C C {0, 1} X 
of VC dimension d. Let y be a distribution on X. For all e > 0, there exists a multiset 
Y C X of size |Y'| < 0(d/e 2 ) such that for all c E C , 


y({x E X : c(x) = 1}) 


|{x E Y : c(x ) = 1}| 
\Y\ 
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Caratheodory’s theorem. The following simple lemma can be thought of as an ap¬ 
proximate and combinatorial version of Caratheodory’s theorem from convex geometry. 
Let C C {0, l} n C and denote by K the convex hull of C in M n . Caratheodory’s 
theorem says that every point p G K is a convex combination of at most n + 1 points 
from C. The lemma says that if VC(C*) is small then every p G K can be approximated 
by a convex combination with a small support. 

Lemma 2.2 (Sampling for dual VC dimension). Let C C {0,1} A and let d* = VC(C*). 
Let p be a distribution on C and let e > 0. Then, p can be e-approximated in L°° by an 
average of at most 0(d*/e 2 ) points from C. That is, there is a multiset F C C of size 
F < 0(d*/e 2 ) so that for every x G X, 


p({c G C : c{x) = 1}) 


\{f G F : f(x) = 1}| 
\F\ 


Proof. Every x G X corresponds to a concept in C*. The distribution p is a distribution 
on the domain of the functions in C*. The lemma follows by Theorem 2.1 applied to 

C*. O 


Minimax. Von Neumann’s minimax theorem [27] is a seminal result in game theory 
(see e.g. the textbook [28]). Assume that there are 2 players 4 , a row player and a column 
player. A pure strategy of the row player is r G [m] and a pure strategy of the column 
player is j G [n] . A mixed strategy is a distribution on pure strategies. Let M be a binary 
matrix so that M (r, j ) = 1 if and only if the row player wins the game when the pure 
strategies r, j are played. 

The minimax theorem says that if for every mixed strategy q of the column player, 
there is a mixed strategy p of the row player that guarantees that the row player wins 
with probability at least V, then there is a mixed strategy p* of the row player so that for 
all mixed strategies q of the column player, the row player wins with probability at least 
V. A similar statement holds for the column player. This implies that there is a pair 
of mixed strategies p*, q* that form a Nash equilibrium for the zero-sum game M defines 
(see [28]). 

Theorem 2.3 (Minimax [27]). Let M G M TOXTl be a real matrix. Then, 

min max jfMq = max min p t Mq, 
pE A m q£ A n q£ A n pE A m 

where is the set of distributions on [d]. 

The arguments in the proof of Theorem 1.4 below imply the following variant of the 
minimax theorem, which may be of interest in the context of game theory. The minimax 

4 We focus on the case of zero-sum games. 
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theorem holds for a general matrix M. In other words, there is no assumption on the set 
of winning/losing states in the game. 

We observe that a combinatorial restriction on the winning/losing states in the game 
implies that there is an approximate efficient equilibrium state. Namely, if the rows of 
M have VC dimension d and the columns of M have VC dimension d*, then for every 
e > 0, there is a multiset of 0(d*/e 2 ) pure strategies R C [m] for the row player, and 
a multiset of 0(d/e 2 ) pure strategies J C [n] for the column player, so that a uniformly 
random choice from R, J guarantees the players a gain that is e-close to the gain in the 
equilibrium strategy. Such a pair of mixed strategies is called an e-Nash equilibrium. 
Lipton and Young [22] showed that in every zero-sum game there are e-Nash equilibriums 
with logarithmic support 5 . The ideas presented here show that if, say, the rows of the 
matrix of the game have constant VC dimension, then there are e-Nash equilibriums with 
constant support. 

3 A sample compression scheme 

We start with a high level description of the compression process (Theorem 1.4). Given a 
sample of the form (' Y,y ), the compression identifies T < 0(d*) subsets Z 1; ..., Z T of Y, 
each of size at most d. It then compresses (Y, y) to (Z, z ) with Z = Uie[T] and z = y\z- 
The additional information % G / allows to recover Zi,..., Zt from Z. The reconstruction 
process uses the information i G / to recover Z \,..., Zt from Z, and then uses the PAC 
learning map H to generate T hypotheses hi,... ,hr defined as h t = H(Z t , z\z t )- The 
final reconstruction hypothesis h = p((Z, z),i ) is the majority vote over hi,..., h T . 

Proof of Theorem l.f. Since the VC dimension of C is d, by Theorem 1.1, there is s = 
0(d) and a proper learning map H : Lq(s ) —> C so that for every c G C and for 
every probability distribution q on X, there is Z C supp(g) of size \Z\ < s so that 
q({x G A" : h z (x) ^ c(x)}) < 1/3 where h z = H(Z,c\ z ). 

Compression. Let (V, y) e L c ( oo). Let 

n = U Y ,y = {H(Z, z) : Z C Y, I z\ < s, z = y\ z } C C. 

The compression is based on the following claim. 

Claim 3.1. There are T < 0(d*) sets Zi, Z 2 ,..., Z T C Y, each of size at most s, so that 
the following holds. Fort G [T], let 


ft — H(Z t ,y\ Zt ). 


( 1 ) 


5 Lipton, Markakis and Mehta [21] proved a similar statement for general games. 



Then, for every x G Y, 


\{t 6 [r] : l,(x) = b(i)}| > T/2. 


( 2 ) 


Given the claim, the compression k(Y, y ) is defined as 

Z = Z t and £ = y\ z . 

tem 


The additional information i G / allows to recover the sets Zi,..., Zt from the set Z. 
There are many possible ways to encode this information, but the size of / can be chosen 
to be at most k k with k — 1 + 0(d*) ■ s < 0(d* ■ d). 

Proof of Claim 3.1. By choice of H , for every distribution q on Y, there is h G PL so that 


q ({x G Y : h(x) = y(x)}) > 2/3. 

By Theorem 2.3, there is a distribution p on PL such that for every x G Y, 

p({h G PL : h(x) = y(x)}) > 2/3. 

By Lemma 2.2 applied to PL and p with e = 1/8, there is a multiset F = {/i, / 2 ,..., /t} C 
PL of size T < 0(d*) so that for every x G Y, 



> p({h G PL : h(x) = y(x)}) — 1/8 > 1/2. 


For every t G [T], let Z t be a subset of Y of size \Z t \ < d so that 


H(Z t , y\ Zt ) — f t . 


□ 


Reconstruction. Given ((Z, z),i), the information i is interpreted as a list of T subsets 
Zi ,..., Z T of Z , each of size at most d. For t G [T], let 


ht = H(Z t ,z\ Zt ). 


Dehne h = p((Z,z),i) as follows: For every x G A", let h{x) be a symbol in {0,1} that 
appears most in the list 

A x ((Z,z),i) = (h 1 (x),h 2 (x),.. .,h T (x)), 
where ties are arbitrarily broken. 
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Correctness. Fix (Y,y) G L c ( oo). Let ((Z,z),i) = n(Y,y) and h = p((Z,z),i). For 
x G Y, consider the list 


M Y , y ) = (AO), h 0), ■ ■ ■, It 0)) 

defined in the compression process of (Y,y). The list <f) x (Y,y) is identical to the list 
A x ((Z,z),i) due to the following three reasons: Equation (1); the information % allows to 
correctly recover Z\,...,Zt\ and y\z t = z\z t for all t G [T]. Finally, by (2), for every 
x G y, the symbol y(x) appears in more than half of the list A x ((Z,z),i) so indeed 
h 0) = y(x). □ 

4 Concluding remarks and questions 

We have shown that every VC class admits a sample compression scheme with size ex¬ 
ponential in its VC dimension. This is the first bound that depends only on the VC 
dimension, and holds for all binary-labeled classes. It is worth noting that many of the 
known compression schemes for special cases, like [10, 3, 19, 30, 24], have size d or O(d) 
which is essentially optimal. In many of these cases, our construction is in fact of size 
polynomial in d, since the VC dimension of the dual class is small as well. Nevertheless, 
Floyd and Warmuth’s question [10, 36] whether sample compression schemes of size 0(d) 
always exist remains open. 

Multi-labeled classes. Unlike VC dimension, sample compression schemes as well as 
the fact that they imply PAC learnability naturally generalizes to multi-labeled concept 
classes (see e.g. [ ].) Littlestone and Warmuth’s question is therefore an instance of 

a more general question: Does the size of an optimal sample compression scheme for 
a given class capture the sample complexity of PAC learning of this class? A positive 
answer to this question will yield a universal and natural parameter that captures the 
sample complexity of PAC learning. 

There are many generalization of VC dimension to multi-labeled concept classes C C 
E A , see [ ] and references within. An example that naturally comes up in our analysis 
is the distinguishing dimension DD(C): For every cGC, define a binary concept class 
B c C {0,1} A as the set of all bh, for h G C, defined by bh(x) = 1 if and only if h(x) = c(x). 
Define 

DD(C') = sup{VC(H c ) : c G C}. 

If C is binary then VC(C) = DD(C'). This definition of dimension is similar to notions 
used in [26, 8, 2]. It can be verifies that if C is multi-labeled then our compression scheme 
for C has size exponential in DD(C). However, although D(VC(C')) is a lower bound 
on the sample complexity of PAC learning for a binary-labeled C, the distinguishing 
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dimension I)D(G') is not a lower bound on the sample complexity of PAC learning for 
a multi-labeled C. Indeed, an example constructed by Danicli and Shalev-Schwartz [ 
implies that there is a concept class C C that is properly PAC learnable with 0(1) 
samples but DD(C) > 0(log |E|). 

Learners’ complexity. The efficiency of our construction relies on the fact that every 
binary-labeled concept class C has a proper learner with optimal sample complexity. A 
closer look at the proof reveals that it is valid even if the learner is not proper; it suffices 
that the set of hypotheses produced by the learner have low VC dimension. 

This motivates the following natural question: Is it true that for every learning map 
H for C C {0,1}- Y with VC(C) = d and for every c 6 C, the set of hypotheses that H 
outputs when learning c has VC dimension 0(d) as well? 

The answer is negative; some students learn although they make things more com¬ 
plicated than necessary. Here is an example. Let n be a power of 2, and consider the 
concept class C = {(00 ... 0)} C {0, 1} X with X = [n + 3 logn] consisting only of the all 
zero concept. The learning map H gets as input a labeled sample (Y, y ) 6 Lc( 3) of size 3, 
and outputs the following hypothesis h. If Y <2 [n\ then h is defined to be 0 everywhere. 
Otherwise, h is defined as 0 on [n] and on the last 3 logn coordinates h is defined as ip(Y), 
where if; is a bijection from [n] 3 to {0, l|[ 31o s n l. First, the image of H has VC dimension 
3 logn since the last 3logn coordinates are shattered by it. Second, the map H is a PAC 
learner for C. Indeed, let /ibea distribution on X. If /i([n]) > 2/3 then the error of h is 
always smaller than 1/3. If y([n\) < 2/3 then the only case that h has positive error is 
that Y C [n], which happens with probability (2/3) 3 < 1/3. 

A variation of the question above is: Does every multi-labeled class C have a learner 
H that makes a nearly optimal number of samples with an image that is not much more 
complicated than Cl 

The answer for binary-labeled classes is affirmative; C has a nearly optimal proper 
learner. Danieli and Shalev-Schwartz [7] showed that there are multi-labeled concept 
classes that are PAC learnable with 0(1) samples but are not properly PAC learnable 
with 0(1) samples. In their example, however, the image of H has just one more concept 
than C. This question therefore remains open. 
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